9 research outputs found
Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation
Sequence-to-Sequence (S2S) models recently started to show state-of-the-art
performance for automatic speech recognition (ASR). With these large and deep
models overfitting remains the largest problem, outweighing performance
improvements that can be obtained from better architectures. One solution to
the overfitting problem is increasing the amount of available training data and
the variety exhibited by the training data with the help of data augmentation.
In this paper we examine the influence of three data augmentation methods on
the performance of two S2S model architectures. One of the data augmentation
method comes from literature, while two other methods are our own development -
a time perturbation in the frequency domain and sub-sequence sampling. Our
experiments on Switchboard and Fisher data show state-of-the-art performance
for S2S models that are trained solely on the speech training data and do not
use additional text data.Comment: To appear in ICASSP 202
Relative Positional Encoding for Speech Recognition and Direct Translation
Transformer models are powerful sequence-to-sequence architectures that are
capable of directly mapping speech inputs to transcriptions or translations.
However, the mechanism for modeling positions in this model was tailored for
text modeling, and thus is less ideal for acoustic inputs. In this work, we
adapt the relative position encoding scheme to the Speech Transformer, where
the key addition is relative distance between input states in the
self-attention network. As a result, the network can better adapt to the
variable distributions present in speech data. Our experiments show that our
resulting model achieves the best recognition result on the Switchboard
benchmark in the non-augmentation condition, and the best published result in
the MuST-C speech translation benchmark. We also show that this model is able
to better utilize synthetic data than the Transformer, and adapts better to
variable sentence segmentation quality for speech translation.Comment: Submitted to Interspeech 202
Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop
We summarize the accomplishments of a multi-disciplinary workshop exploring
the computational and scientific issues surrounding the discovery of linguistic
units (subwords and words) in a language without orthography. We study the
replacement of orthographic transcriptions by images and/or translated text in
a well-resourced language to help unsupervised discovery from raw speech.Comment: Accepted to ICASSP 201
Speech technology for unwritten languages
Speech technology plays an important role in our everyday life. Among others, speech is used for human-computer interaction, for instance for information retrieval and on-line shopping. In the case of an unwritten language, however, speech technology is unfortunately difficult to create, because it cannot be created by the standard combination of pre-trained speech-to-text and text-to-speech subsystems. The research presented in this article takes the first steps towards speech technology for unwritten languages. Specifically, the aim of this work was 1) to learn speech-to-meaning representations without using text as an intermediate representation, and 2) to test the sufficiency of the learned representations to regenerate speech or translated text, or to retrieve images that depict the meaning of an utterance in an unwritten language. The results suggest that building systems that go directly from speech-to-meaning and from meaning-to-speech, bypassing the need for text, is possible.Multimedia Computin
Computational language documentation: some results from the BULB project
International audienceabstrac